2024 LLM系统论文集合

Pre-Training

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Reducing Activation Recomputation in Large Transformer Models
Optimized Network Architectures for Large Language Model Training with Billions of Parameters | MIT
Carbon Emissions and Large Neural Network Training | Google, UCB
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates | SOSP 23
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
Perseus: Removing Energy Bloat from Large Model Training
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | ByteDance
DISTMM: Accelerating distributed multimodal model training | NSDI’ 24
A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters
Pipeline Parallelism with Controllable Memory | Sea AI Lab

Serving

Orca: A Distributed Serving System for Transformer-Based Generative Models | OSDI 22
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline | NUS
Efficiently Scaling Transformer Inference | MLSys’ 23
Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
DeepSpeed Inference : Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
TurboTransformers: An Efficient GPU Serving System For Transformer Models
MPCFormer : fast, performant, and private transformer inference with MPC | ICLR’23
POLCA: Power Oversubscription in LLM Cloud Providers | Microsoft
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | Microsoft
FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU | ICML’ 23
AttMemo: Accelerating Self-Attention with Memoization on Big Memory Systems
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | SOSP’ 23
Tabi: An Efficient Multi-Level Inference System for Large Language Models | EuroSys’ 23
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | VLDB’ 24
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation | Microsoft
FlashDecoding++: Faster Large Language Model Inference on GPUs | Tsinghua
DeepSpeed-MII: Model Implementations for Inference (MII) ｜ Microsoft
Punica: Multi-Tenant LoRA Serving
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining | ASPLOS 23
SpotServe: Serving Generative Large Language Models on Preemptible Instances | CMU
LLM in a flash: Efficient Large Language Model Inference with Limited Memory | Apple
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
Fairness in Serving Large Language Models | OSDI’ 24
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
APIServe: Efficient API Support for Large-Language Model Inferencing
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
Optimizing LLM Queries in Relational Workloads | UCB
AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving | NUS
MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | PKU
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation | PKU
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services | Umich
BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Fine-tuning Systems

Ymir: A Scheduler for Foundation Model Fine-tuning Workloads in Datacenters | ICS’ 24

Multi-Model Systems

MOSEL: Inference Serving Using Dynamic Modality Selection
DISTMM: Accelerating distributed multimodal model training | NSDI’ 24

Image Generation Systems

Approximate Caching for Efficiently Serving Diffusion Models | Adobe Research
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models | MIT

LLM for Systems

Large Language Models for Compiler Optimization
The Hitchhiker’s Guide to Program Analysis: A Journey with Large Language Models
LLM-Assisted Code Cleaning For Training Accurate Code Generators | UCB

System Efficiency Optimization

Fast Distributed Inference Serving for Large Language Models | PKU
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance | Stanford
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | ICML ES-FoMo Workshop 2023
Inference with Reference: Lossless Acceleration of Large Language Models
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inferencex
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
Knowledge-preserving Pruning for Pre-trained Language Models without Retraining | SNU
Accelerating LLM Inference with Staged Speculative Decoding | ICML’ 23
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification | CMU
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ICML’ 23
S3: Increasing GPU Utilization during Generative Inference for Higher Throughput | Havard
LLMCad: Fast and Scalable On-device Large Language Model Inference
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding | THU
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery ｜ Microsoft
Ring Attention with Blockwise Transformers for Near-Infinite Context | UCB
Learned Best-Effort LLM Serving | UCB

ML Systems

INFaaS: Automated Model-less Inference Serving | ATC’ 21
Alpa : Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning | OSDI’ 22
Pathways : Asynchronous Distributed Dataflow for ML | MLSys’ 22
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale ICML’ 2022.
ZeRO-Offload : Democratizing Billion-Scale Model Training.
ZeRO-Infinity : Breaking the GPU Memory Wall for Extreme Scale Deep Learning
ZeRO : memory optimizations toward training trillion parameter models.
Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors | MobiSys ’22
Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing | ATC’22
Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access | Eurosys’23
Cocktail: A Multidimensional Optimization for Model Serving in Cloud | NSDI’22
Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
SHEPHERD : Serving DNNs in the Wild
Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
AutoScratch: ML-Optimized Cache Management for Inference-Oriented GPUs
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Channel Permutations for N:M Sparsity | MLSys’ 23
Welder : Scheduling Deep Learning Memory Access via Tile-graph | OSDI’ 23
Optimizing Dynamic Neural Networks with Brainstorm | OSDI’23
ModelKeeper: Accelerating DNN Training via Automated Training Warmup | NSDI’23
Breadth-First Pipeline Parallelism | MLSys’ 23
MGG : Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms | OSDI’ 23
Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters | OSDI’ 23
Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning | OSDI’ 23
BPipe: Memory-Balanced Pipeline Parallelism for TrainingLarge Language Models

Survey Paper

Efficient Large Language Models: A Survey
Challenges and Applications of Large Language Models
Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

LLM Benchmark / Leaderboard Traces

LLM Energy Leaderboard | Umich
LLM-Perf Leaderboard | HuggingFace
Aviary Explorer | Anyscale
Open LLM Leaderboard | HuggingFace
HELM | Stanford
LMSYS | UCB
Towards Efficient and Reliable LLM Serving: A Real-World Workload Study

LLM Frameworks

AutoGen :Enable Next-Gen Large Language Model Applications | Microsoft
DeepSpeed: a deep learning optimization library that makes distributed training and inference easy, efficient, and effective | Microsoft
TensorRT-LLM | Nvidia
Accelerate | Hugging Face
vLLM | UCB
Ray-LLM | Ray

MLSys Courses

Systems for Machine Learning | (Stanford)[https://cs229s.stanford.edu/fall2023/]
Systems for Generative AI | (Umich)[https://github.com/mosharaf/eecs598/tree/w24-genai]
Systems for AI - LLMs | (GT)[https://cs8803-sp24.anand-iyer.com/]

Other list

A curated list of Large Language Model【Hannibal046/Awesome-LLM: Awesome-LLM: a curated list of Large Language Model (github.com)】
AI systems paper list【lambda7xx/awesome-AI-system: paper and its code for AI System (github.com)】
A baseline repository of Auto-Parallelism in Training Neural Networks【ConnollyLeon/awesome-Auto-Parallelism: A baseline repository of Auto-Parallelism in Training Neural Networks (github.com)】
Numbers every LLM Developer should know【ray-project/llm-numbers: Numbers every LLM developer should know (github.com)】

readlist · 目录

上一篇机器学习系统推荐阅读

阅读 858

2024 LLM系统论文集合

Pre-Training

Serving

Fine-tuning Systems

Multi-Model Systems

Image Generation Systems

LLM for Systems

System Efficiency Optimization

ML Systems

Survey Paper

LLM Benchmark / Leaderboard Traces

LLM Frameworks

Related ML Readings

MLSys Courses

Other list